-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Implement Shard and CoordinatorLog metadata durability #4360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/java/org/apache/cassandra/replication/MutationTrackingService.java
Outdated
Show resolved
Hide resolved
localWitnessed = Offsets.Mutable.copy(witnessedOffsets.get(localNodeId)); | ||
|
||
witnessedOffsets.convertToPrimitiveMap(witnessed); | ||
persistedOffsets.convertToPrimitiveMap(persisted); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to make this incremental in the future, since every witnessed offset will eventually be persisted and reconciled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this one. We must overwrite the entire value of each list in the system table because of the representation, but that shouldn't be a big deal - the size of each list should be roughly the same - the head, with not gaps, collapsed into one range, plus the slightly sparse tail. I don't see how to make it incremental in this context, or what it'd bring. I could easily be missing something though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I wasn't thinking about the frozen representation at the time. Makes sense as-is for now.
persistedOffsets
could grow for quite a while if a single replica is down and can't reconcile, so that's where I see this getting expensive in the future.
patch by Aleksey Yeschenko; reviewed by Abe Ratnofsky for CASSANDRA-20882
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Would be good to have a test extending org.apache.cassandra.fuzz.topology.TopologyMixupTestBase
to test bounces as well.
Broadcasting persisted offsets every minute feels infrequent to me. The more frequently we broadcast persisted offsets, the sooner we can mark SSTables as repaired and purge CoordinatorLogOffsets. We want to avoid doing too many compactions on an SSTable before we can mark it repaired, so maybe we determine persistence and broadcast timing based on the number of mutations that have been marked reconciled, more frequent when reconciliation is happening quickly.
for (int offset = iter.start(), end = iter.end(); offset <= end; offset++) | ||
{ | ||
ShortMutationId id = new ShortMutationId(witnessed.logId, offset); | ||
result.addForTesting(MutationJournal.instance.read(id)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid calling a *ForTesting method from a production path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That must've been a botched refactoring rename, clearly unintended. Thanks for catching it.
localWitnessed = Offsets.Mutable.copy(witnessedOffsets.get(localNodeId)); | ||
|
||
witnessedOffsets.convertToPrimitiveMap(witnessed); | ||
persistedOffsets.convertToPrimitiveMap(persisted); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I wasn't thinking about the frozen representation at the time. Makes sense as-is for now.
persistedOffsets
could grow for quite a while if a single replica is down and can't reconcile, so that's where I see this getting expensive in the future.
I must be missing something, but why would it grow? We aren't recording what's reconciled - we are recording what's been witnessed and written to the system table. |
Discussed out-loud - I had some concerns about our offsets representation causing trouble if we end up with sparse representations, but that's a niche case that will impact more than just the durability paths. We should measure the logical vs. physical size of offsets, or "dropped" mutation IDs, to see if that's a problem for any real workloads. Merge away. |
Committed as bd5a657, cheers. |
patch by Aleksey Yeschenko; reviewed by Abe Ratnofsky for CASSANDRA-20882